Related papers: Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control

Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control

URL: http://arxiv.org/abs/2411.02461v1
Date: Mon, 04 Nov 2024 08:36:03 GMT
Title: Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control
Authors: Yuxin Xiao, Chaoqun Wan, Yonggang Zhang, Wenxiao Wang, Binbin Lin, Xiaofei He, Xu Shen, Jieping Ye,
Abstract summary: Large Language Models (LLMs) have become a critical area of research in Reinforcement Learning from Human Feedback (RLHF) representation engineering offers a new, training-free approach. This technique leverages semantic features to control the representation of LLM's intermediate hidden states. It is difficult to encode various semantic contents, like honesty and safety, into a singular semantic feature.
Score: 44.326363467045496
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As the development and application of Large Language Models (LLMs) continue to advance rapidly, enhancing their trustworthiness and aligning them with human preferences has become a critical area of research. Traditional methods rely heavily on extensive data for Reinforcement Learning from Human Feedback (RLHF), but representation engineering offers a new, training-free approach. This technique leverages semantic features to control the representation of LLM's intermediate hidden states, enabling the model to meet specific requirements such as increased honesty or heightened safety awareness. However, a significant challenge arises when attempting to fulfill multiple requirements simultaneously. It proves difficult to encode various semantic contents, like honesty and safety, into a singular semantic feature, restricting its practicality. In this work, we address this issue through ``Sparse Activation Control''. By delving into the intrinsic mechanisms of LLMs, we manage to identify and pinpoint components that are closely related to specific tasks within the model, i.e., attention heads. These heads display sparse characteristics that allow for near-independent control over different tasks. Our experiments, conducted on the open-source Llama series models, have yielded encouraging results. The models were able to align with human preferences on issues of safety, factuality, and bias concurrently.

Related papers

Beyond Redundancy: Diverse and Specialized Multi-Expert Sparse Autoencoder [59.89996751196727]
Sparse autoencoders (SAEs) have emerged as a powerful tool for interpreting large language models.<n>SAEs' hidden layers have high dimensionality to satisfy sparsity constraints, resulting in prohibitive training and inference costs.<n>Recent Mixture of Experts (MoE) approaches attempt to address this by SAEs into narrower expert networks with gated activation.<n>We propose two key innovations: (1) Multiple Expert Activation that simultaneously engages semantically weighted expert subsets to encourage specialization, and (2) Feature Scaling that enhances diversity through adaptive high-frequency scaling.
arXiv Detail & Related papers (2025-11-07T22:19:34Z)
Activation-Space Personality Steering: Hybrid Layer Selection for Stable Trait Control in LLMs [10.99947795031516]
Large Language Models exhibit implicit personalities in their generation, but reliably controlling or aligning these traits to meet specific needs remains an open challenge.<n>We propose a novel pipeline that extracts hidden state activations from transformer layers using the Big Five Personality Traits.<n>Our findings reveal that personality traits occupy a low-rank shared subspace, and that these latent structures can be transformed into actionable mechanisms for effective steering.
arXiv Detail & Related papers (2025-10-29T05:56:39Z)
Training-free Uncertainty Guidance for Complex Visual Tasks with MLLMs [61.64185573373394]
We propose a training-free framework that uses an MLLM's intrinsic uncertainty as a proactive guidance signal.<n>We introduce a unified mechanism that scores candidate visual inputs by response uncertainty, enabling the model to autonomously focus on the most salient data.<n>Our work validates that harnessing intrinsic uncertainty is a powerful, general strategy for enhancing fine-grained multimodal performance.
arXiv Detail & Related papers (2025-10-01T09:20:51Z)
Quantization Meets dLLMs: A Systematic Study of Post-training Quantization for Diffusion LLMs [78.09559830840595]
We present the first systematic study on quantizing diffusion-based language models.<n>We identify the presence of activation outliers, characterized by abnormally large activation values.<n>We implement state-of-the-art PTQ methods and conduct a comprehensive evaluation.
arXiv Detail & Related papers (2025-08-20T17:59:51Z)
Reasoning Meets Personalization: Unleashing the Potential of Large Reasoning Model for Personalized Generation [21.89080753903469]
We present the first systematic evaluation of large reasoning models (LRMs) for personalization tasks.<n>Our analysis identifies three key limitations: divergent thinking, misalignment of response formats, and ineffective use of retrieved information.<n>We propose Reinforced Reasoning for Personalization (model), a novel framework that incorporates a hierarchical reasoning thought template to guide LRMs in generating structured outputs.
arXiv Detail & Related papers (2025-05-23T07:30:13Z)
Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models? [83.53005932513155]
Multi-modal large language models (MLLMs) have made significant progress, yet their safety alignment remains limited. We propose finetuning MLLMs on a small set of benign instruct-following data with responses replaced by simple, clear rejection sentences.
arXiv Detail & Related papers (2025-04-14T09:03:51Z)
UC-MOA: Utility-Conditioned Multi-Objective Alignment for Distributional Pareto-Optimality [52.49062565901046]
Reinforcement Learning from Human Feedback (RLHF) has become a cornerstone for aligning large language models with human values.<n>Existing approaches struggle to capture the multi-dimensional, distributional nuances of human preferences.<n>We introduce Utility-Conditioned Multi-Objective Alignment (UC-MOA), a novel framework that overcomes these limitations.
arXiv Detail & Related papers (2025-03-10T09:52:42Z)
Activation Space Interventions Can Be Transferred Between Large Language Models [0.0]
We show that safety interventions can be transferred between models through learned mappings of their shared activation spaces. We demonstrate this approach on two well-established AI safety tasks: backdoor removal and refusal of harmful prompts. We also propose a new task, textitcorrupted capabilities, where models are fine-tuned to embed knowledge tied to a backdoor.
arXiv Detail & Related papers (2025-03-06T13:38:44Z)
LF-Steering: Latent Feature Activation Steering for Enhancing Semantic Consistency in Large Language Models [16.37602070339033]
Large Language Models (LLMs) often generate inconsistent responses when prompted with semantically equivalent paraphrased inputs. We propose LF-Steering, a novel activation steering approach to precisely identify latent feature representations responsible for semantic inconsistency. Our method maps the hidden states of the relevant transformer layer into a sparsely activated, high-dimensional feature space based on a sparse autoencoder.
arXiv Detail & Related papers (2025-01-19T13:06:51Z)
Unified Parameter-Efficient Unlearning for LLMs [25.195126838721492]
Large Language Models (LLMs) have revolutionized natural language processing, enabling advanced understanding and reasoning capabilities across a variety of tasks. This raises significant privacy and security concerns, as models may inadvertently retain and disseminate sensitive or undesirable information. We introduce a novel instance-wise unlearning framework, LLMEraser, which systematically categorizes unlearning tasks and applies precise adjustments using influence functions.
arXiv Detail & Related papers (2024-11-30T07:21:02Z)
Inverse-RLignment: Inverse Reinforcement Learning from Demonstrations for LLM Alignment [62.05713042908654]
We introduce Alignment from Demonstrations (AfD), a novel approach leveraging high-quality demonstration data to overcome these challenges. We formalize AfD within a sequential decision-making framework, highlighting its unique challenge of missing reward signals. Practically, we propose a computationally efficient algorithm that extrapolates over a tailored reward model for AfD.
arXiv Detail & Related papers (2024-05-24T15:13:53Z)
Semantic Density: Uncertainty Quantification for Large Language Models through Confidence Measurement in Semantic Space [14.715989394285238]
Existing Large Language Models (LLMs) do not have an inherent functionality to provide the users with an uncertainty/confidence metric for each response it generates. A new framework is proposed in this paper to address these issues. Semantic density extracts uncertainty/confidence information for each response from a probability distribution perspective in semantic space.
arXiv Detail & Related papers (2024-05-22T17:13:49Z)
Scalable Language Model with Generalized Continual Learning [58.700439919096155]
The Joint Adaptive Re-ization (JARe) is integrated with Dynamic Task-related Knowledge Retrieval (DTKR) to enable adaptive adjustment of language models based on specific downstream tasks. Our method demonstrates state-of-the-art performance on diverse backbones and benchmarks, achieving effective continual learning in both full-set and few-shot scenarios with minimal forgetting.
arXiv Detail & Related papers (2024-04-11T04:22:15Z)
Privacy Preserving Prompt Engineering: A Survey [14.402638881376419]
Pre-trained language models (PLMs) have demonstrated significant proficiency in solving a wide range of general natural language processing (NLP) tasks. As a result, the sizes of these models have notably expanded in recent years. Privacy concerns have become a major obstacle in its widespread usage.
arXiv Detail & Related papers (2024-04-09T04:11:25Z)
CogBench: a large language model walks into a psychology lab [12.981407327149679]
This paper introduces CogBench, a benchmark that includes ten behavioral metrics derived from seven cognitive psychology experiments. We apply CogBench to 35 large language models (LLMs) and analyze this data using statistical multilevel modeling techniques. We find that open-source models are less risk-prone than proprietary models and that fine-tuning on code does not necessarily enhance LLMs' behavior.
arXiv Detail & Related papers (2024-02-28T10:43:54Z)
Characterizing Truthfulness in Large Language Model Generations with Local Intrinsic Dimension [63.330262740414646]
We study how to characterize and predict the truthfulness of texts generated from large language models (LLMs) We suggest investigating internal activations and quantifying LLM's truthfulness using the local intrinsic dimension (LID) of model activations.
arXiv Detail & Related papers (2024-02-28T04:56:21Z)
DeAL: Decoding-time Alignment for Large Language Models [59.63643988872571]
Large Language Models (LLMs) are nowadays expected to generate content aligned with human preferences. We propose DeAL, a framework that allows the user to customize reward functions and enables Detime Alignment of LLMs. Our experiments show that we can DeAL with fine-grained trade-offs, improve adherence to alignment objectives, and address residual gaps in LLMs.
arXiv Detail & Related papers (2024-02-05T06:12:29Z)
Fortifying Ethical Boundaries in AI: Advanced Strategies for Enhancing Security in Large Language Models [3.9490749767170636]
Large language models (LLMs) have revolutionized text generation, translation, and question-answering tasks. Despite their widespread use, LLMs present challenges such as ethical dilemmas when models are compelled to respond inappropriately. This paper addresses these challenges by introducing a multi-pronged approach that includes: 1) filtering sensitive vocabulary from user input to prevent unethical responses; 2) detecting role-playing to halt interactions that could lead to 'prison break' scenarios; and 4) extending these methodologies to various LLM derivatives like Multi-Model Large Language Models (MLLMs)
arXiv Detail & Related papers (2024-01-27T08:09:33Z)
InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance [56.184255657175335]
We develop textbfInferAligner, a novel inference-time alignment method that utilizes cross-model guidance for harmlessness alignment. Experimental results show that our method can be very effectively applied to domain-specific models in finance, medicine, and mathematics. It significantly diminishes the Attack Success Rate (ASR) of both harmful instructions and jailbreak attacks, while maintaining almost unchanged performance in downstream tasks.
arXiv Detail & Related papers (2024-01-20T10:41:03Z)
Aligning Large Language Models with Human Preferences through Representation Engineering [41.81020951061438]
Drawing inspiration from the emerging field of representation engineering (RepE), this study aims to identify relevant representations for high-level human preferences embedded in patterns of activity within an LLM. This novel approach, denoted as Representation Alignment from Human Feedback (RAHF), proves to be effective, computationally efficient, and easy to implement.
arXiv Detail & Related papers (2023-12-26T11:01:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.