Trust-Oriented Adaptive Guardrails for Large Language Models
        - URL: http://arxiv.org/abs/2408.08959v2
 - Date: Mon, 03 Feb 2025 16:03:18 GMT
 - Title: Trust-Oriented Adaptive Guardrails for Large Language Models
 - Authors: Jinwei Hu, Yi Dong, Xiaowei Huang, 
 - Abstract summary: Guardrails are designed to ensure that large language models (LLMs) align with human values by moderating harmful or toxic responses.<n>This paper addresses a critical issue: existing guardrails lack a well-founded methodology to accommodate the diverse needs of different user groups.<n>We introduce an adaptive guardrail mechanism, to dynamically moderate access to sensitive content based on user trust metrics.
 - Score: 9.719986610417441
 - License: http://creativecommons.org/licenses/by/4.0/
 - Abstract:   Guardrail, an emerging mechanism designed to ensure that large language models (LLMs) align with human values by moderating harmful or toxic responses, requires a sociotechnical approach in their design. This paper addresses a critical issue: existing guardrails lack a well-founded methodology to accommodate the diverse needs of different user groups, particularly concerning access rights. Supported by trust modeling (primarily on `social' aspect) and enhanced with online in-context learning via retrieval-augmented generation (on `technical' aspect), we introduce an adaptive guardrail mechanism, to dynamically moderate access to sensitive content based on user trust metrics. User trust metrics, defined as a novel combination of direct interaction trust and authority-verified trust, enable the system to precisely tailor the strictness of content moderation by aligning with the user's credibility and the specific context of their inquiries. Our empirical evaluation demonstrates the effectiveness of the adaptive guardrail in meeting diverse user needs, outperforming existing guardrails while securing sensitive information and precisely managing potentially hazardous content through a context-aware knowledge base. To the best of our knowledge, this work is the first to introduce trust-oriented concept into a guardrail system, offering a scalable solution that enriches the discourse on ethical deployment for next-generation LLM service. 
 
       
      
        Related papers
        - Six Guidelines for Trustworthy, Ethical and Responsible Automation   Design [0.6144680854063939]
Calibrated trust in automated systems is critical for their safe and seamless integration into society.<n>We propose six design guidelines to help designers optimize for accurate trustworthiness assessments.
arXiv  Detail & Related papers  (2025-08-04T13:01:09Z) - LLM Agents Should Employ Security Principles [60.03651084139836]
This paper argues that the well-established design principles in information security should be employed when deploying Large Language Model (LLM) agents at scale.<n>We introduce AgentSandbox, a conceptual framework embedding these security principles to provide safeguards throughout an agent's life-cycle.
arXiv  Detail & Related papers  (2025-05-29T21:39:08Z) - Epistemic Alignment: A Mediating Framework for User-LLM Knowledge   Delivery [17.23286832909591]
We propose a set of ten challenges in transmission of knowledge derived from the philosophical literature.
We find users develop workarounds to address each of the challenges.
For AI developers, the Epistemic Alignment Framework offers concrete guidance for supporting diverse approaches to knowledge.
arXiv  Detail & Related papers  (2025-04-01T21:38:12Z) - REVAL: A Comprehension Evaluation on Reliability and Values of Large   Vision-Language Models [59.445672459851274]
REVAL is a comprehensive benchmark designed to evaluate the textbfREliability and textbfVALue of Large Vision-Language Models.
REVAL encompasses over 144K image-text Visual Question Answering (VQA) samples, structured into two primary sections: Reliability and Values.
We evaluate 26 models, including mainstream open-source LVLMs and prominent closed-source models like GPT-4o and Gemini-1.5-Pro.
arXiv  Detail & Related papers  (2025-03-20T07:54:35Z) - On the Trustworthiness of Generative Foundation Models: Guideline,   Assessment, and Perspective [333.9220561243189]
Generative Foundation Models (GenFMs) have emerged as transformative tools.
Their widespread adoption raises critical concerns regarding trustworthiness across dimensions.
This paper presents a comprehensive framework to address these challenges through three key contributions.
arXiv  Detail & Related papers  (2025-02-20T06:20:36Z) - On the Fairness, Diversity and Reliability of Text-to-Image Generative   Models [68.62012304574012]
multimodal generative models have sparked critical discussions on their reliability, fairness and potential for misuse.<n>We propose an evaluation framework to assess model reliability by analyzing responses to global and local perturbations in the embedding space.<n>Our method lays the groundwork for detecting unreliable, bias-injected models and tracing the provenance of embedded biases.
arXiv  Detail & Related papers  (2024-11-21T09:46:55Z) - A Flexible Large Language Models Guardrail Development Methodology   Applied to Off-Topic Prompt Detection [0.0]
Large Language Models are prone to off-topic misuse, where users may prompt these models to perform tasks beyond their intended scope.
Current guardrails suffer from high false-positive rates, limited adaptability, and the impracticality of requiring real-world data that is not available in pre-production.
This paper introduces a flexible, data-free guardrail development methodology that addresses these challenges.
arXiv  Detail & Related papers  (2024-11-20T00:31:23Z) - Unveiling User Preferences: A Knowledge Graph and LLM-Driven Approach   for Conversational Recommendation [55.5687800992432]
We propose a plug-and-play framework that synergizes Large Language Models (LLMs) and Knowledge Graphs (KGs) to unveil user preferences.
This enables the LLM to transform KG entities into concise natural language descriptions, allowing them to comprehend domain-specific knowledge.
arXiv  Detail & Related papers  (2024-11-16T11:47:21Z) - Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse   Activation Control [44.326363467045496]
Large Language Models (LLMs) have become a critical area of research in Reinforcement Learning from Human Feedback (RLHF)
representation engineering offers a new, training-free approach.
This technique leverages semantic features to control the representation of LLM's intermediate hidden states.
It is difficult to encode various semantic contents, like honesty and safety, into a singular semantic feature.
arXiv  Detail & Related papers  (2024-11-04T08:36:03Z) - CURATe: Benchmarking Personalised Alignment of Conversational AI   Assistants [5.7605009639020315]
Assessment of ten leading models across five scenarios (each with 337 use cases)
Key failure modes include inappropriate weighing of conflicting preferences, sycophancy, a lack of attentiveness to critical user information within the context window, and inconsistent application of user-specific knowledge.
We propose research directions for embedding self-reflection capabilities, online user modelling, and dynamic risk assessment in AI assistants.
arXiv  Detail & Related papers  (2024-10-28T15:59:31Z) - Trustworthy AI: Securing Sensitive Data in Large Language Models [0.0]
Large Language Models (LLMs) have transformed natural language processing (NLP) by enabling robust text generation and understanding.
This paper proposes a comprehensive framework for embedding trust mechanisms into LLMs to dynamically control the disclosure of sensitive information.
arXiv  Detail & Related papers  (2024-09-26T19:02:33Z) - Con-ReCall: Detecting Pre-training Data in LLMs via Contrastive Decoding [118.75567341513897]
Existing methods typically analyze target text in isolation or solely with non-member contexts.
We propose Con-ReCall, a novel approach that leverages the asymmetric distributional shifts induced by member and non-member contexts.
arXiv  Detail & Related papers  (2024-09-05T09:10:38Z) - Towards More Trustworthy and Interpretable LLMs for Code through   Syntax-Grounded Explanations [48.07182711678573]
ASTrust generates explanations grounded in the relationship between model confidence and syntactic structures of programming languages.
We develop an automated visualization that illustrates the aggregated model confidence scores superimposed on sequence, heat-map, and graph-based visuals of syntactic structures from ASTs.
arXiv  Detail & Related papers  (2024-07-12T04:38:28Z) - TRACE: TRansformer-based Attribution using Contrastive Embeddings in   LLMs [50.259001311894295]
We propose a novel TRansformer-based Attribution framework using Contrastive Embeddings called TRACE.
We show that TRACE significantly improves the ability to attribute sources accurately, making it a valuable tool for enhancing the reliability and trustworthiness of large language models.
arXiv  Detail & Related papers  (2024-07-06T07:19:30Z) - RELIC: Investigating Large Language Model Responses using   Self-Consistency [58.63436505595177]
Large Language Models (LLMs) are notorious for blending fact with fiction and generating non-factual content, known as hallucinations.
We propose an interactive system that helps users gain insight into the reliability of the generated text.
arXiv  Detail & Related papers  (2023-11-28T14:55:52Z) - A Systematic Literature Review of User Trust in AI-Enabled Systems: An
  HCI Perspective [0.0]
User trust in Artificial Intelligence (AI) enabled systems has been increasingly recognized and proven as a key element to fostering adoption.
This review aims to provide an overview of the user trust definitions, influencing factors, and measurement methods from 23 empirical studies.
arXiv  Detail & Related papers  (2023-04-18T07:58:09Z) - Designing for Responsible Trust in AI Systems: A Communication
  Perspective [56.80107647520364]
We draw from communication theories and literature on trust in technologies to develop a conceptual model called MATCH.
We highlight transparency and interaction as AI systems' affordances that present a wide range of trustworthiness cues to users.
We propose a checklist of requirements to help technology creators identify appropriate cues to use.
arXiv  Detail & Related papers  (2022-04-29T00:14:33Z) - Intent Contrastive Learning for Sequential Recommendation [86.54439927038968]
We introduce a latent variable to represent users' intents and learn the distribution function of the latent variable via clustering.
We propose to leverage the learned intents into SR models via contrastive SSL, which maximizes the agreement between a view of sequence and its corresponding intent.
Experiments conducted on four real-world datasets demonstrate the superiority of the proposed learning paradigm.
arXiv  Detail & Related papers  (2022-02-05T09:24:13Z) - Personalized multi-faceted trust modeling to determine trust links in
  social media and its potential for misinformation management [61.88858330222619]
We present an approach for predicting trust links between peers in social media.
We propose a data-driven multi-faceted trust modeling which incorporates many distinct features for a comprehensive analysis.
 Illustrated in a trust-aware item recommendation task, we evaluate the proposed framework in the context of a large Yelp dataset.
arXiv  Detail & Related papers  (2021-11-11T19:40:51Z) - RoFL: Attestable Robustness for Secure Federated Learning [59.63865074749391]
Federated Learning allows a large number of clients to train a joint model without the need to share their private data.
To ensure the confidentiality of the client updates, Federated Learning systems employ secure aggregation.
We present RoFL, a secure Federated Learning system that improves robustness against malicious clients.
arXiv  Detail & Related papers  (2021-07-07T15:42:49Z) 
        This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.