Related papers: Trust-Oriented Adaptive Guardrails for Large Language Models

Trust-Oriented Adaptive Guardrails for Large Language Models

URL: http://arxiv.org/abs/2408.08959v2
Date: Mon, 03 Feb 2025 16:03:18 GMT
Title: Trust-Oriented Adaptive Guardrails for Large Language Models
Authors: Jinwei Hu, Yi Dong, Xiaowei Huang,
Abstract summary: Guardrails are designed to ensure that large language models (LLMs) align with human values by moderating harmful or toxic responses.<n>This paper addresses a critical issue: existing guardrails lack a well-founded methodology to accommodate the diverse needs of different user groups.<n>We introduce an adaptive guardrail mechanism, to dynamically moderate access to sensitive content based on user trust metrics.
Score: 9.719986610417441
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Guardrail, an emerging mechanism designed to ensure that large language models (LLMs) align with human values by moderating harmful or toxic responses, requires a sociotechnical approach in their design. This paper addresses a critical issue: existing guardrails lack a well-founded methodology to accommodate the diverse needs of different user groups, particularly concerning access rights. Supported by trust modeling (primarily on `social' aspect) and enhanced with online in-context learning via retrieval-augmented generation (on `technical' aspect), we introduce an adaptive guardrail mechanism, to dynamically moderate access to sensitive content based on user trust metrics. User trust metrics, defined as a novel combination of direct interaction trust and authority-verified trust, enable the system to precisely tailor the strictness of content moderation by aligning with the user's credibility and the specific context of their inquiries. Our empirical evaluation demonstrates the effectiveness of the adaptive guardrail in meeting diverse user needs, outperforming existing guardrails while securing sensitive information and precisely managing potentially hazardous content through a context-aware knowledge base. To the best of our knowledge, this work is the first to introduce trust-oriented concept into a guardrail system, offering a scalable solution that enriches the discourse on ethical deployment for next-generation LLM service.

Related papers

Epistemic Alignment: A Mediating Framework for User-LLM Knowledge Delivery [17.23286832909591]
We propose a set of ten challenges in transmission of knowledge derived from the philosophical literature. We find users develop workarounds to address each of the challenges. For AI developers, the Epistemic Alignment Framework offers concrete guidance for supporting diverse approaches to knowledge.
arXiv Detail & Related papers (2025-04-01T21:38:12Z)
REVAL: A Comprehension Evaluation on Reliability and Values of Large Vision-Language Models [59.445672459851274]
REVAL is a comprehensive benchmark designed to evaluate the textbfREliability and textbfVALue of Large Vision-Language Models. REVAL encompasses over 144K image-text Visual Question Answering (VQA) samples, structured into two primary sections: Reliability and Values. We evaluate 26 models, including mainstream open-source LVLMs and prominent closed-source models like GPT-4o and Gemini-1.5-Pro.
arXiv Detail & Related papers (2025-03-20T07:54:35Z)
On the Trustworthiness of Generative Foundation Models: Guideline, Assessment, and Perspective [333.9220561243189]
Generative Foundation Models (GenFMs) have emerged as transformative tools. Their widespread adoption raises critical concerns regarding trustworthiness across dimensions. This paper presents a comprehensive framework to address these challenges through three key contributions.
arXiv Detail & Related papers (2025-02-20T06:20:36Z)
A Flexible Large Language Models Guardrail Development Methodology Applied to Off-Topic Prompt Detection [0.0]
Large Language Models are prone to off-topic misuse, where users may prompt these models to perform tasks beyond their intended scope. Current guardrails suffer from high false-positive rates, limited adaptability, and the impracticality of requiring real-world data that is not available in pre-production. This paper introduces a flexible, data-free guardrail development methodology that addresses these challenges.
arXiv Detail & Related papers (2024-11-20T00:31:23Z)
Unveiling User Preferences: A Knowledge Graph and LLM-Driven Approach for Conversational Recommendation [55.5687800992432]
We propose a plug-and-play framework that synergizes Large Language Models (LLMs) and Knowledge Graphs (KGs) to unveil user preferences. This enables the LLM to transform KG entities into concise natural language descriptions, allowing them to comprehend domain-specific knowledge.
arXiv Detail & Related papers (2024-11-16T11:47:21Z)
Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control [44.326363467045496]
Large Language Models (LLMs) have become a critical area of research in Reinforcement Learning from Human Feedback (RLHF) representation engineering offers a new, training-free approach. This technique leverages semantic features to control the representation of LLM's intermediate hidden states. It is difficult to encode various semantic contents, like honesty and safety, into a singular semantic feature.
arXiv Detail & Related papers (2024-11-04T08:36:03Z)
CURATe: Benchmarking Personalised Alignment of Conversational AI Assistants [5.7605009639020315]
Assessment of ten leading models across five scenarios (each with 337 use cases) Key failure modes include inappropriate weighing of conflicting preferences, sycophancy, a lack of attentiveness to critical user information within the context window, and inconsistent application of user-specific knowledge. We propose research directions for embedding self-reflection capabilities, online user modelling, and dynamic risk assessment in AI assistants.
arXiv Detail & Related papers (2024-10-28T15:59:31Z)
Trustworthy AI: Securing Sensitive Data in Large Language Models [0.0]
Large Language Models (LLMs) have transformed natural language processing (NLP) by enabling robust text generation and understanding. This paper proposes a comprehensive framework for embedding trust mechanisms into LLMs to dynamically control the disclosure of sensitive information.
arXiv Detail & Related papers (2024-09-26T19:02:33Z)
Con-ReCall: Detecting Pre-training Data in LLMs via Contrastive Decoding [118.75567341513897]
Existing methods typically analyze target text in isolation or solely with non-member contexts. We propose Con-ReCall, a novel approach that leverages the asymmetric distributional shifts induced by member and non-member contexts.
arXiv Detail & Related papers (2024-09-05T09:10:38Z)
Towards More Trustworthy and Interpretable LLMs for Code through Syntax-Grounded Explanations [48.07182711678573]
ASTrust generates explanations grounded in the relationship between model confidence and syntactic structures of programming languages. We develop an automated visualization that illustrates the aggregated model confidence scores superimposed on sequence, heat-map, and graph-based visuals of syntactic structures from ASTs.
arXiv Detail & Related papers (2024-07-12T04:38:28Z)
TRACE: TRansformer-based Attribution using Contrastive Embeddings in LLMs [50.259001311894295]
We propose a novel TRansformer-based Attribution framework using Contrastive Embeddings called TRACE. We show that TRACE significantly improves the ability to attribute sources accurately, making it a valuable tool for enhancing the reliability and trustworthiness of large language models.
arXiv Detail & Related papers (2024-07-06T07:19:30Z)
RELIC: Investigating Large Language Model Responses using Self-Consistency [58.63436505595177]
Large Language Models (LLMs) are notorious for blending fact with fiction and generating non-factual content, known as hallucinations. We propose an interactive system that helps users gain insight into the reliability of the generated text.
arXiv Detail & Related papers (2023-11-28T14:55:52Z)
A Systematic Literature Review of User Trust in AI-Enabled Systems: An HCI Perspective [0.0]
User trust in Artificial Intelligence (AI) enabled systems has been increasingly recognized and proven as a key element to fostering adoption. This review aims to provide an overview of the user trust definitions, influencing factors, and measurement methods from 23 empirical studies.
arXiv Detail & Related papers (2023-04-18T07:58:09Z)
Designing for Responsible Trust in AI Systems: A Communication Perspective [56.80107647520364]
We draw from communication theories and literature on trust in technologies to develop a conceptual model called MATCH. We highlight transparency and interaction as AI systems' affordances that present a wide range of trustworthiness cues to users. We propose a checklist of requirements to help technology creators identify appropriate cues to use.
arXiv Detail & Related papers (2022-04-29T00:14:33Z)
Intent Contrastive Learning for Sequential Recommendation [86.54439927038968]
We introduce a latent variable to represent users' intents and learn the distribution function of the latent variable via clustering. We propose to leverage the learned intents into SR models via contrastive SSL, which maximizes the agreement between a view of sequence and its corresponding intent. Experiments conducted on four real-world datasets demonstrate the superiority of the proposed learning paradigm.
arXiv Detail & Related papers (2022-02-05T09:24:13Z)
Personalized multi-faceted trust modeling to determine trust links in social media and its potential for misinformation management [61.88858330222619]
We present an approach for predicting trust links between peers in social media. We propose a data-driven multi-faceted trust modeling which incorporates many distinct features for a comprehensive analysis. Illustrated in a trust-aware item recommendation task, we evaluate the proposed framework in the context of a large Yelp dataset.
arXiv Detail & Related papers (2021-11-11T19:40:51Z)
RoFL: Attestable Robustness for Secure Federated Learning [59.63865074749391]
Federated Learning allows a large number of clients to train a joint model without the need to share their private data. To ensure the confidentiality of the client updates, Federated Learning systems employ secure aggregation. We present RoFL, a secure Federated Learning system that improves robustness against malicious clients.
arXiv Detail & Related papers (2021-07-07T15:42:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.