CURATe: Benchmarking Personalised Alignment of Conversational AI Assistants
- URL: http://arxiv.org/abs/2410.21159v1
- Date: Mon, 28 Oct 2024 15:59:31 GMT
- Title: CURATe: Benchmarking Personalised Alignment of Conversational AI Assistants
- Authors: Lize Alberts, Benjamin Ellis, Andrei Lupu, Jakob Foerster,
- Abstract summary: Assessment of ten leading models across five scenarios (each with 337 use cases)
Key failure modes include inappropriate weighing of conflicting preferences, sycophancy, a lack of attentiveness to critical user information within the context window, and inconsistent application of user-specific knowledge.
We propose research directions for embedding self-reflection capabilities, online user modelling, and dynamic risk assessment in AI assistants.
- Score: 5.7605009639020315
- License:
- Abstract: We introduce a multi-turn benchmark for evaluating personalised alignment in LLM-based AI assistants, focusing on their ability to handle user-provided safety-critical contexts. Our assessment of ten leading models across five scenarios (each with 337 use cases) reveals systematic inconsistencies in maintaining user-specific consideration, with even top-rated "harmless" models making recommendations that should be recognised as obviously harmful to the user given the context provided. Key failure modes include inappropriate weighing of conflicting preferences, sycophancy (prioritising user preferences above safety), a lack of attentiveness to critical user information within the context window, and inconsistent application of user-specific knowledge. The same systematic biases were observed in OpenAI's o1, suggesting that strong reasoning capacities do not necessarily transfer to this kind of personalised thinking. We find that prompting LLMs to consider safety-critical context significantly improves performance, unlike a generic 'harmless and helpful' instruction. Based on these findings, we propose research directions for embedding self-reflection capabilities, online user modelling, and dynamic risk assessment in AI assistants. Our work emphasises the need for nuanced, context-aware approaches to alignment in systems designed for persistent human interaction, aiding the development of safe and considerate AI assistants.
Related papers
- On the loss of context-awareness in general instruction fine-tuning [101.03941308894191]
Post-training methods such as supervised fine-tuning (SFT) on instruction-response pairs can harm existing capabilities learned during pretraining.
We propose two methods to mitigate the loss of context awareness in instruct models: post-hoc attention steering on user prompts and conditional instruction fine-tuning with a context-dependency indicator.
arXiv Detail & Related papers (2024-11-05T00:16:01Z) - Interpretable Rule-Based System for Radar-Based Gesture Sensing: Enhancing Transparency and Personalization in AI [2.99664686845172]
We introduce MIRA, a transparent and interpretable multi-class rule-based algorithm tailored for radar-based gesture detection.
We showcase the system's adaptability through personalized rule sets that calibrate to individual user behavior, offering a user-centric AI experience.
Our research underscores MIRA's ability to deliver both high interpretability and performance and emphasizes the potential for broader adoption of interpretable AI in safety-critical applications.
arXiv Detail & Related papers (2024-09-30T16:40:27Z) - To Err Is AI! Debugging as an Intervention to Facilitate Appropriate Reliance on AI Systems [11.690126756498223]
Vision for optimal human-AI collaboration requires 'appropriate reliance' of humans on AI systems.
In practice, the performance disparity of machine learning models on out-of-distribution data makes dataset-specific performance feedback unreliable.
arXiv Detail & Related papers (2024-09-22T09:43:27Z) - Con-ReCall: Detecting Pre-training Data in LLMs via Contrastive Decoding [118.75567341513897]
Existing methods typically analyze target text in isolation or solely with non-member contexts.
We propose Con-ReCall, a novel approach that leverages the asymmetric distributional shifts induced by member and non-member contexts.
arXiv Detail & Related papers (2024-09-05T09:10:38Z) - Adaptive Guardrails For Large Language Models via Trust Modeling and In-Context Learning [9.719986610417441]
Guardrails have become an integral part of Large language models (LLMs)
This study introduces an adaptive guardrail mechanism, supported by trust modeling and enhanced with in-context learning.
By leveraging a combination of direct interaction trust and authority-verified trust, the system precisely tailors the strictness of content moderation to align with the user's credibility.
arXiv Detail & Related papers (2024-08-16T18:07:48Z) - Aligning Large Language Models from Self-Reference AI Feedback with one General Principle [61.105703857868775]
We propose a self-reference-based AI feedback framework that enables a 13B Llama2-Chat to provide high-quality feedback.
Specifically, we allow the AI to first respond to the user's instructions, then generate criticism of other answers based on its own response as a reference.
Finally, we determine which answer better fits human preferences according to the criticism.
arXiv Detail & Related papers (2024-06-17T03:51:46Z) - Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMs [57.16442740983528]
In ad-hoc retrieval, evaluation relies heavily on user actions, including implicit feedback.
The role of user feedback in annotators' assessment of turns in a conversational perception has been little studied.
We focus on how the evaluation of task-oriented dialogue systems ( TDSs) is affected by considering user feedback, explicit or implicit, as provided through the follow-up utterance of a turn being evaluated.
arXiv Detail & Related papers (2024-04-19T16:45:50Z) - ASSERT: Automated Safety Scenario Red Teaming for Evaluating the
Robustness of Large Language Models [65.79770974145983]
ASSERT, Automated Safety Scenario Red Teaming, consists of three methods -- semantically aligned augmentation, target bootstrapping, and adversarial knowledge injection.
We partition our prompts into four safety domains for a fine-grained analysis of how the domain affects model performance.
We find statistically significant performance differences of up to 11% in absolute classification accuracy among semantically related scenarios and error rates of up to 19% absolute error in zero-shot adversarial settings.
arXiv Detail & Related papers (2023-10-14T17:10:28Z) - RAH! RecSys-Assistant-Human: A Human-Centered Recommendation Framework
with LLM Agents [30.250555783628762]
This research argues that addressing these issues is not solely the recommender systems' responsibility.
We introduce the RAH Recommender system, Assistant, and Human framework, emphasizing the alignment with user personalities.
Our contributions provide a human-centered recommendation framework that partners effectively with various recommendation models.
arXiv Detail & Related papers (2023-08-19T04:46:01Z) - A Systematic Literature Review of User Trust in AI-Enabled Systems: An
HCI Perspective [0.0]
User trust in Artificial Intelligence (AI) enabled systems has been increasingly recognized and proven as a key element to fostering adoption.
This review aims to provide an overview of the user trust definitions, influencing factors, and measurement methods from 23 empirical studies.
arXiv Detail & Related papers (2023-04-18T07:58:09Z) - Exploring Robustness of Unsupervised Domain Adaptation in Semantic
Segmentation [74.05906222376608]
We propose adversarial self-supervision UDA (or ASSUDA) that maximizes the agreement between clean images and their adversarial examples by a contrastive loss in the output space.
This paper is rooted in two observations: (i) the robustness of UDA methods in semantic segmentation remains unexplored, which pose a security concern in this field; and (ii) although commonly used self-supervision (e.g., rotation and jigsaw) benefits image tasks such as classification and recognition, they fail to provide the critical supervision signals that could learn discriminative representation for segmentation tasks.
arXiv Detail & Related papers (2021-05-23T01:50:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.