Adaptive Guardrails For Large Language Models via Trust Modeling and In-Context Learning
- URL: http://arxiv.org/abs/2408.08959v1
- Date: Fri, 16 Aug 2024 18:07:48 GMT
- Title: Adaptive Guardrails For Large Language Models via Trust Modeling and In-Context Learning
- Authors: Jinwei Hu, Yi Dong, Xiaowei Huang,
- Abstract summary: Guardrails have become an integral part of Large language models (LLMs)
This study introduces an adaptive guardrail mechanism, supported by trust modeling and enhanced with in-context learning.
By leveraging a combination of direct interaction trust and authority-verified trust, the system precisely tailors the strictness of content moderation to align with the user's credibility.
- Score: 9.719986610417441
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Guardrails have become an integral part of Large language models (LLMs), by moderating harmful or toxic response in order to maintain LLMs' alignment to human expectations. However, the existing guardrail methods do not consider different needs and access rights of individual users, and treat all the users with the same rule. This study introduces an adaptive guardrail mechanism, supported by trust modeling and enhanced with in-context learning, to dynamically modulate access to sensitive content based on user trust metrics. By leveraging a combination of direct interaction trust and authority-verified trust, the system precisely tailors the strictness of content moderation to align with the user's credibility and the specific context of their inquiries. Our empirical evaluations demonstrate that the adaptive guardrail effectively meets diverse user needs, outperforming existing guardrails in practicality while securing sensitive information and precisely managing potentially hazardous content through a context-aware knowledge base. This work is the first to introduce trust-oriented concept within a guardrail system, offering a scalable solution that enriches the discourse on ethical deployment for next-generation LLMs.
Related papers
- Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control [44.326363467045496]
Large Language Models (LLMs) have become a critical area of research in Reinforcement Learning from Human Feedback (RLHF)
representation engineering offers a new, training-free approach.
This technique leverages semantic features to control the representation of LLM's intermediate hidden states.
It is difficult to encode various semantic contents, like honesty and safety, into a singular semantic feature.
arXiv Detail & Related papers (2024-11-04T08:36:03Z) - CURATe: Benchmarking Personalised Alignment of Conversational AI Assistants [5.7605009639020315]
Assessment of ten leading models across five scenarios (each with 337 use cases)
Key failure modes include inappropriate weighing of conflicting preferences, sycophancy, a lack of attentiveness to critical user information within the context window, and inconsistent application of user-specific knowledge.
We propose research directions for embedding self-reflection capabilities, online user modelling, and dynamic risk assessment in AI assistants.
arXiv Detail & Related papers (2024-10-28T15:59:31Z) - Trustworthy AI: Securing Sensitive Data in Large Language Models [0.0]
Large Language Models (LLMs) have transformed natural language processing (NLP) by enabling robust text generation and understanding.
This paper proposes a comprehensive framework for embedding trust mechanisms into LLMs to dynamically control the disclosure of sensitive information.
arXiv Detail & Related papers (2024-09-26T19:02:33Z) - Con-ReCall: Detecting Pre-training Data in LLMs via Contrastive Decoding [118.75567341513897]
Existing methods typically analyze target text in isolation or solely with non-member contexts.
We propose Con-ReCall, a novel approach that leverages the asymmetric distributional shifts induced by member and non-member contexts.
arXiv Detail & Related papers (2024-09-05T09:10:38Z) - Towards More Trustworthy and Interpretable LLMs for Code through Syntax-Grounded Explanations [48.07182711678573]
ASTrust generates explanations grounded in the relationship between model confidence and syntactic structures of programming languages.
We develop an automated visualization that illustrates the aggregated model confidence scores superimposed on sequence, heat-map, and graph-based visuals of syntactic structures from ASTs.
arXiv Detail & Related papers (2024-07-12T04:38:28Z) - TRACE: TRansformer-based Attribution using Contrastive Embeddings in LLMs [50.259001311894295]
We propose a novel TRansformer-based Attribution framework using Contrastive Embeddings called TRACE.
We show that TRACE significantly improves the ability to attribute sources accurately, making it a valuable tool for enhancing the reliability and trustworthiness of large language models.
arXiv Detail & Related papers (2024-07-06T07:19:30Z) - Confidence Under the Hood: An Investigation into the Confidence-Probability Alignment in Large Language Models [14.5291643644017]
We introduce the concept of Confidence-Probability Alignment.
We probe the alignment between models' internal and expressed confidence.
Among the models analyzed, OpenAI's GPT-4 showed the strongest confidence-probability alignment.
arXiv Detail & Related papers (2024-05-25T15:42:04Z) - RELIC: Investigating Large Language Model Responses using Self-Consistency [58.63436505595177]
Large Language Models (LLMs) are notorious for blending fact with fiction and generating non-factual content, known as hallucinations.
We propose an interactive system that helps users gain insight into the reliability of the generated text.
arXiv Detail & Related papers (2023-11-28T14:55:52Z) - Intent Contrastive Learning for Sequential Recommendation [86.54439927038968]
We introduce a latent variable to represent users' intents and learn the distribution function of the latent variable via clustering.
We propose to leverage the learned intents into SR models via contrastive SSL, which maximizes the agreement between a view of sequence and its corresponding intent.
Experiments conducted on four real-world datasets demonstrate the superiority of the proposed learning paradigm.
arXiv Detail & Related papers (2022-02-05T09:24:13Z) - Personalized multi-faceted trust modeling to determine trust links in
social media and its potential for misinformation management [61.88858330222619]
We present an approach for predicting trust links between peers in social media.
We propose a data-driven multi-faceted trust modeling which incorporates many distinct features for a comprehensive analysis.
Illustrated in a trust-aware item recommendation task, we evaluate the proposed framework in the context of a large Yelp dataset.
arXiv Detail & Related papers (2021-11-11T19:40:51Z) - RoFL: Attestable Robustness for Secure Federated Learning [59.63865074749391]
Federated Learning allows a large number of clients to train a joint model without the need to share their private data.
To ensure the confidentiality of the client updates, Federated Learning systems employ secure aggregation.
We present RoFL, a secure Federated Learning system that improves robustness against malicious clients.
arXiv Detail & Related papers (2021-07-07T15:42:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.