Related papers: A Causal Explainable Guardrails for Large Language Models

A Causal Explainable Guardrails for Large Language Models

URL: http://arxiv.org/abs/2405.04160v2
Date: Wed, 4 Sep 2024 13:29:56 GMT
Title: A Causal Explainable Guardrails for Large Language Models
Authors: Zhixuan Chu, Yan Wang, Longfei Li, Zhibo Wang, Zhan Qin, Kui Ren,
Abstract summary: Large Language Models (LLMs) have shown impressive performance in natural language tasks, but their outputs can exhibit undesirable attributes or biases. We propose LLMGuardrail, a novel framework that incorporates causal analysis and adversarial learning to obtain unbiased steering representations.
Score: 29.441292837667415
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have shown impressive performance in natural language tasks, but their outputs can exhibit undesirable attributes or biases. Existing methods for steering LLMs toward desired attributes often assume unbiased representations and rely solely on steering prompts. However, the representations learned from pre-training can introduce semantic biases that influence the steering process, leading to suboptimal results. We propose LLMGuardrail, a novel framework that incorporates causal analysis and adversarial learning to obtain unbiased steering representations in LLMs. LLMGuardrail systematically identifies and blocks the confounding effects of biases, enabling the extraction of unbiased steering representations. Additionally, it includes an explainable component that provides insights into the alignment between the generated output and the desired direction. Experiments demonstrate LLMGuardrail's effectiveness in steering LLMs toward desired attributes while mitigating biases. Our work contributes to the development of safe and reliable LLMs that align with desired attributes.

Related papers

Anchors in the Machine: Behavioral and Attributional Evidence of Anchoring Bias in LLMs [0.0]
This paper advances the study of anchoring in large language models (LLMs) through three contributions.<n>Results reveal robust anchoring effects in Gemma-2B, Phi-2, and Llama-2-7B, with attribution signaling that the anchors influence reweighting.<n>Findings demonstrate that anchoring bias in LLMs is robust, measurable, and interpretable, while highlighting risks in applied domains.
arXiv Detail & Related papers (2025-11-07T23:35:19Z)
From Insight to Exploit: Leveraging LLM Collaboration for Adaptive Adversarial Text Generation [3.75886080255807]
We introduce two innovative attack frameworks designed to generate dynamic and adaptive adversarial examples.<n>We produce subtle and natural-looking adversarial inputs that preserve semantic similarity to the original text.<n>Our attacks evolve with the advancements in LLMs and demonstrate strong transferability acrossversa unknown to the attacker.
arXiv Detail & Related papers (2025-11-05T02:27:56Z)
A Course Correction in Steerability Evaluation: Revealing Miscalibration and Side Effects in LLMs [14.334903198382287]
It remains unclear whether large language models can produce outputs aligned with a broad variety of user goals.<n> Interventions to improve steerability, such as prompt engineering, have varying effectiveness.<n>Even strong LLMs struggle with steerability, and existing alignment strategies may be insufficient.
arXiv Detail & Related papers (2025-05-27T21:29:52Z)
Steering LLMs for Formal Theorem Proving [0.29465623430708915]
Large Language Models (LLMs) have shown promise in proving formal theorems using proof assistants like Lean.<n>We observe that the LLM is capable of predicting the correct tactic; however, it faces challenges in ranking it appropriately within the set of candidate tactics.<n>We use activation steering to guide LLMs responses to improve the generations at the time of inference.
arXiv Detail & Related papers (2025-02-21T15:04:48Z)
LLM-Lasso: A Robust Framework for Domain-Informed Feature Selection and Regularization [59.75242204923353]
We introduce LLM-Lasso, a framework that leverages large language models (LLMs) to guide feature selection in Lasso regression. LLMs generate penalty factors for each feature, which are converted into weights for the Lasso penalty using a simple, tunable model. Features identified as more relevant by the LLM receive lower penalties, increasing their likelihood of being retained in the final model.
arXiv Detail & Related papers (2025-02-15T02:55:22Z)
Look Before You Leap: Enhancing Attention and Vigilance Regarding Harmful Content with GuidelineLLM [53.79753074854936]
Large language models (LLMs) are increasingly vulnerable to emerging jailbreak attacks.<n>This vulnerability poses significant risks to real-world applications.<n>We propose a novel defensive paradigm called GuidelineLLM.
arXiv Detail & Related papers (2024-12-10T12:42:33Z)
NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models [30.685419129265252]
We bridge the divide between VLN-specialized models and LLM-based navigation paradigms. We exploit a way to incorporate LLMs and navigation policy networks for effective action predictions and navigational reasoning.
arXiv Detail & Related papers (2024-07-17T07:44:26Z)
UniBias: Unveiling and Mitigating LLM Bias through Internal Attention and FFN Manipulation [12.04811490937078]
We investigate how feedforward neural networks (FFNs) and attention heads result in the bias of large language models (LLMs) To mitigate these biases, we introduce UniBias, an inference-only method that effectively identifies and eliminates biased FFN vectors and attention heads.
arXiv Detail & Related papers (2024-05-31T03:59:15Z)
FLAME: Factuality-Aware Alignment for Large Language Models [86.76336610282401]
The conventional alignment process fails to enhance the factual accuracy of large language models (LLMs) We identify factors that lead to hallucination in both alignment steps: supervised fine-tuning (SFT) and reinforcement learning (RL) We propose factuality-aware alignment, comprised of factuality-aware SFT and factuality-aware RL through direct preference optimization.
arXiv Detail & Related papers (2024-05-02T17:54:54Z)
The Strong Pull of Prior Knowledge in Large Language Models and Its Impact on Emotion Recognition [74.04775677110179]
In-context Learning (ICL) has emerged as a powerful paradigm for performing natural language tasks with Large Language Models (LLM) We show that LLMs have strong yet inconsistent priors in emotion recognition that ossify their predictions. Our results suggest that caution is needed when using ICL with larger LLMs for affect-centered tasks outside their pre-training domain.
arXiv Detail & Related papers (2024-03-25T19:07:32Z)
Steering LLMs Towards Unbiased Responses: A Causality-Guided Debiasing Framework [20.753141804841]
Large language models (LLMs) can easily generate biased and discriminative responses. This paper focuses on social bias, tackling the association between demographic information and LLM outputs.
arXiv Detail & Related papers (2024-03-13T17:46:28Z)
Debiasing Multimodal Large Language Models [61.6896704217147]
Large Vision-Language Models (LVLMs) have become indispensable tools in computer vision and natural language processing. Our investigation reveals a noteworthy bias in the generated content, where the output is primarily influenced by the underlying Large Language Models (LLMs) prior to the input image. To rectify these biases and redirect the model's focus toward vision information, we introduce two simple, training-free strategies.
arXiv Detail & Related papers (2024-03-08T12:35:07Z)
Causal Prompting: Debiasing Large Language Model Prompting based on Front-Door Adjustment [32.12998469814097]
A novel causal prompting method based on front-door adjustment is proposed to effectively mitigate Large Language Models (LLMs) biases. Experimental results show that the proposed causal prompting approach achieves excellent performance across seven natural language processing datasets.
arXiv Detail & Related papers (2024-03-05T07:47:34Z)
Learning to Generate Explainable Stock Predictions using Self-Reflective Large Language Models [54.21695754082441]
We propose a framework to teach Large Language Models (LLMs) to generate explainable stock predictions. A reflective agent learns how to explain past stock movements through self-reasoning, while the PPO trainer trains the model to generate the most likely explanations. Our framework can outperform both traditional deep-learning and LLM methods in prediction accuracy and Matthews correlation coefficient.
arXiv Detail & Related papers (2024-02-06T03:18:58Z)
Tailoring Personality Traits in Large Language Models via Unsupervisedly-Built Personalized Lexicons [42.66142331217763]
Personality plays a pivotal role in shaping human expression patterns. Previous methods relied on fine-tuning large language models (LLMs) on specific corpora. We have employed a novel Unsupervisedly-Built personalized lexicon (UBPL) in a pluggable manner to manipulate personality traits.
arXiv Detail & Related papers (2023-10-25T12:16:33Z)
Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools. Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions. Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z)
Guiding Large Language Models via Directional Stimulus Prompting [114.84930073977672]
We introduce Directional Stimulus Prompting, a novel framework for guiding black-box large language models (LLMs) toward specific desired outputs. Instead of directly adjusting LLMs, our method employs a small tunable policy model to generate an auxiliary directional stimulus prompt for each input instance.
arXiv Detail & Related papers (2023-02-22T17:44:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.